Ever since COVID-19 erupted into our world, research institutes and governments have released plenty of databases publicly to allow research groups and independent individuals to analyze the data around the coronavirus’s spread. We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are one of the better ways to tackle and deal with this difficult time.
In this blog, we are interested to know ‘How the world’s news media is covering the COVID-19 pandemic?’ Building on the massive television news narratives dataset GDELT released a powerful news dataset of the URLs, titles, publication dates and brief snippet of more than 1.1 million worldwide English language online news articles mentioning the virus to enable researchers and journalists to understand the global context of how the outbreak has been covered since November 2019. This dataset has been expanding daily and includes a number of related topics.
A single article on Covid-19 can cover various topics like health, business implications of the disease or climate changes or it could just be a front to propagate fake information. Given the huge amount of news articles floating around the web in the wake of Covid-19, it is very difficult to compile and compare the news articles. To conduct an analysis of what is being discussed during these difficult times, we would have to first collect all the news articles and then annotate them according to their implicit news sub-categories. This motivates us to create an approach such that we could annotate news articles on Coronavirus without any manual intervention. By creating such a pipeline we not only aim to help researchers, media persons and Journalists to have access to similar articles but also avoid the overhead of time spent in reading and understanding unrelated articles. Thus we aim to improve the quality of similar articles and thus topics representing them.
We intend to solve the huge flow of information called “information overload” which makes it harder for users to find similar information on Covid-19 on the internet. We solve this with an application that enables the user to find news of their query/interest effortlessly. We are foreseeing some challenges, that include determining the subtopic, extract only the content of each webpage and present the data to the user. In real-world applications, multi-label classification (MLC) has a lot of utility in which objects can be identified by more than one label. It’s costly and tedious to manually label the dataset. An unsupervised learning approach should, therefore, be considered to take advantage of clustering similar datasets and eventually doing topic modelling to multi-label the clusters. We use unsupervised learning technique(Clustering) to group a collection of articles so that articles in the same category are more similar to each other than those in other groups. Clustering can be used to help classify the types of a structure discovered.
We are trying to analyze the large set of news articles to help make it easier for common people to filter through many articles related to the virus, and find their own resoluteness.Furthermore, we would want to understand the semantic relations between different topics. And finally, analyze keywords to uncover patterns in the news content.
Can we find articles with similar topics to a given an article ?
In order to answer this question, we need to answer the following reasearch questions:
1. What is the most dominant topic in the article?
2. How to determine the value of K is best suited for topic modeling on our dataset
3. How does the topic model perform with different features, namely Term frequency–Inverse document frequency (Tf - Idf) along with Baf of Words and Bag of words (BoW) by itself.
Data source
For our dataset we required news articles that spoke about the ongoing coronavirus pandemic. In our search, we came across the Gdelt Project, that contained a compilation of URLs and brief snippets of worldwide English language news coverage mentioning Covid-19. It contains data from the the period November 1, 2019 through March 26, 2020. Gdelt dataset: http://data.gdeltproject.org/blog/2020-coronavirus-narrative/live_onlinenews/MASTERFILELIST.TXT
Scraping Method
On digging deeper into the dataset we realized that only snippets of the news articles were included.The snippets were chosen by performing a keyword search for the given terms: Cases, Covid19, Falsehoods, Masks, Panic, Prices, Quarantine, Shortages, SocialDistancing, Testing and Ventilators; and selecting the paragraph with its first occurrence. In addition to the presence of one of the given terms, either the sentence itself or the one’s before and after them should also contain the term “Coronavirus” or “Covid-19”, thus ensuring that the news article is realted to coronavirus.
The Gdelt dataset had news articles related to coronavirus, but just a snippet wouldn’t be suffiecient to understand the underlying topic of an article. Hence, we decided to scrape the articles by ourselves by using urls corresponding to each article of the Gdelt dataset.
The dataset contained several files, each containing articles extracted on a particular day, having a particular keyword. As considering all the articles in each file would be computationally too heavy & infeasible, we agreed on creating a dataset having around 20000 records. We realize the topics discussed during the initial period of the pandemic and in the months to follow must have evolved. In order to capture the wide array of topics over the duration of 5 months, we first downloaded all files. Then for all the files belonging to a keyword we extract certain records. This was repeated for all keywords. Thus at the end of the extraction process we had around 20000 news articles as our final dataset.
Cleanup
As the content we extracted were from websites, it contained numerous html tags and special characters. In the preprocessing stage,we first converted the data to lower case. We then cleaned the data by removing the urls(www, http), punctuations, special charachters, stopwords and also stripped the whitespaces in it. Once the preprocessing was complete the preprocessed corpus was ready for analysis.
Storage
The dataset after preprocessing was stored in a csv file and uploaded on the drive. Dataset: https://drive.google.com/file/d/1qgQiIIi1yhXBj1jAOVz_2dhNT4C2i6bc/view?usp=sharing
ggplot(main_df) +
aes(x = original_label) +
geom_bar(position = "dodge", fill = "#4292c6") +
theme_linedraw()
…
Using Bag of Words Model with Term Frequency Weighting scheme.
wordcloud(words = d_bow$word, freq = d_bow$freq, min.freq = 1,
max.words=100,scale = c(4, 0.2), random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Using Bag of Words Model with TF-IDF Weighting scheme.
wordcloud(words = d_tfidf$word, freq = d_tfidf$freq,scale = c(2, 0.1), min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
We’ve used elbow method along with other metrics to determine optimal topics for Topic Modelling.
…
…
…
…
links <- data.frame(
source = top_terms$topic,
target = top_terms$term,
value = top_terms$beta
)
nodes <- data.frame(
name=c(as.character(links$source),
as.character(links$target)) %>% unique()
)
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
sinksRight=FALSE,fontSize = 16,height = 1400,width = 1200,
nodePadding = 8, fontFamily = "arial",unit = "Letter(s)")
p
chordDiagram(new_v,big.gap = 10,directional = 1, direction.type = c("diffHeight", "arrows"),link.arr.type = "big.arrow", diffHeight = -mm_h(1),grid.col = c("violet", "blue4", "blue","green", "yellow","tomato","red","cyan4","deeppink","cyan3","chocolate4","darkslategrey","darksalmon","chartreuse","darkorchid2","deepskyblue1","lightcoral", "palegreen4", "paleturquoise2","palevioletred", "peru", "pink4", "purple2","sienna1","skyblue2","seagreen2","rosybrown","plum3","slateblue2","orange3","darkgoldenrod2","salmon2","pink2")
…
…
…
…
…
Below metrics were used for evaluating model.
…
…
What did you learn about the data? How did you answer the questions? How can you justify your answers?
| Name | Email-Id | Mattr No. |
|---|---|---|
| Calida Pereira | calida.pereira@st.ovgu.de | 229945 |
| Chandan Radhakrishna | chandan.radhakrishna@st.ovgu.de | 229746 |
| Nandish Bandi Subbarayappa | nandish.bandi@st.ovgu.de | 229591 |
| Mohit Jaripatke | mohit.jaripatke@st.ovgu.de | 224651 |
| Priyanka Bhargava | priyanka.bhargava@st.ovgu.de | 229675 |
© 2020 GitHub, Inc. Terms Privacy Security Status Help Contact GitHub Pricing API Training Blog About